Transcriptome characterization of Larrea tridentata and identification of genes associated with phenylpropanoid metabolic pathways

Larrea tridentata (Sesse and Moc. ex DC.) Coville (family: Zygophyllaceae) is an aromatic evergreen shrub with resin-covered leaves, known to use in traditional medicine for diverse ailments. It also has immense pharmacological significance due to presence of powerful phenylpropanoids antioxidant, nordihydroguaiaretic acid (NDGA). The RNA sequence/transcriptome analyses connect the genomic information into the discovery of gene function. Hence, the acquaint analysis of L. tridentata is in lieu to characterize the transcriptome, and to identify the candidate genes involved in the phenylpropanoid biosynthetic pathway. To gain molecular insight, the bioinformatics analysis of transcriptome was performed. The total bases covered 48,630 contigs of length greater than 200 bp and above came out to 21,590,549 with an average GC content of 45% and an abundance of mononucleotide, SSR, including C3H, FAR1, and MADS transcription gene families. The best enzyme commission (EC) classification obtained from the assembled sequences represented major abundant enzyme classes e.g., RING-type E3 ubiquitin transferase and non-specific serine/ threonine protein kinase. The KEGG pathway analysis mapped into 377 KEGG different metabolic pathways. The enrichment of phenylpropanoid biosynthesis pathways (22 genes i.e., phenylalanine ammonia-lyase, trans-cinnamate 4-monooxygenase, 4-coumarate-CoA ligase, cinnamoyl-CoA reductase, beta-glucosidase, shikimate O-hydroxycinnamoyl transferase, 5-O-(4-coumaroyl)-D-quinate 3’-monooxygenase, cinnamyl-alcohol dehydrogenase, peroxidase, coniferyl-alcohol glucosyltransferase, caffeoyl shikimate esterase, caffeoyl-CoA O-methyltransferase, caffeate O-methyltransferase, coniferyl-aldehyde dehydrogenase, feruloyl-CoA 6-hydroxylase, and ferulate-5-hydroxylase), and expression profile indicated antioxidant, anti-arthritic, and anticancer properties of L. tridentata. The present PLOS ONE PLOS ONE | https://doi.org/10.1371/journal.pone.0265231 March 11, 2022 1 / 13 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111

The information content of an organism is recorded in the DNA of its genome and expressed through the process called transcription. Transcriptome, the entire pool of transcripts in an organism or single cell at certain physiological or pathological stage, is indispensable in unravelling the connection and regulation between DNA and protein; thus, a transcriptome captures a snapshot in time of the total transcripts present in a cell [12]. The next generation RNA sequencing (RNA seq) has evolved as one of the most widely used techniques for cost-effective and massive amounts of high quality gene expression data within a shorter time [12][13][14] in the absence of a reference genome [15][16][17]. The RNA seq/transcriptome analyses connect the genomic information into the discovery of gene function [18]. During the last decade, the transcriptome analyses have propelled the understanding of genomic information, regulatory mechanisms of the genome, and their biological implications [19] e.g., metabolic pathway [20][21][22], comparative transcriptomics [23] and evolutionary genomics [14,24]. In recent years, the characterization of the transcriptome of medicinal plants has widely been studied to discover the secondary metabolic pathways and the related genes responsible for the production of effective natural products required for further pharmaceutical research [25] and metabolic engineering [26].
Hence, the acquaint transcriptome analysis of pharmacologically worthy L. tridentata is in lieu to characterize the transcriptome, and to identify candidate genes involved in the phenylpropanoid biosynthetic pathway.

de novo assembly
The RNA transcriptome SRA data of L. tridentata available from 'One Thousand Plant Transcriptomes Initiative' [24] were retrieved. The de novo assembly of L. tridentata transcriptome of a total number of 60,02,560 good quality reads out of 75,12,845 total reads was optimized after assessing the effect of various k-mer (17, 21, 23, 25, 27, 31 and 35) lengths. The high-quality trimmed reads were assembled using the SOAPdenovo program [27]. The total number of contigs, contigs with length of 200 bp and above, N50 value, longest contig length, and average contig length as a function of k-mer, were analyzed. The NCBI NR non-redundant protein database was used for similarity search and annotation of the assembled transcripts and extracted the best hit with another taxon. The protein sequences of Arabidopsis thaliana and L. tridentata were used in OrthoFinder [28] to find orthologous genes in A. thaliana and L. tridentata using the reciprocal blast alignment algorithm [28]. The results from OrthoFinder were used to identify A. thaliana genes best matched with L. tridentata. Further, collinear gene pairs between L. tridentata and A. thaliana were generated using the McScanX toolkit [29]. Those genes in A. thaliana that had the best match with L. tridentata were further visualized for their synteny using TBtools [30].

GC content analysis, identification of simple sequence repeats (SSRs), and transcription factor families (TFFs)
The GC content analysis was performed using in-house developed R script. The MISA-web (http://pgrc.ipk-gatersleben.de/misa/) was used to identify the SSRs in the unigenes [31]. For identification of the transcription factor families (TFFs) represented in L. tridentata transcriptome, the transcripts were searched for homology against all the transcription factor protein sequences at PlnTFDB (plant transcription factor database) using BLASTX.

GO analysis and search of the KEGG pathway
The transcripts were assigned to GO terms to describe the functions of genes using Blast2GO (https://www.blast2go.com/), and associated gene products were subjected to KEGG pathway search (http://www.genome.jp/kegg). The distribution of the KEGG ortholog genes involved in the pathways of interest from L. tridentata was compared with the available transcriptomes from the plants belonging to Zygophyllaceae family using KAAS [32] server (https://www. genome.jp/kegg/kaas/). Based on the count of ortholog genes involved in the pathways, Zscore was calculated, and further visualized as a heatmap using the R function.

Results and discussion
de-novo assembly de novo assembly of L. tridentata transcriptome was optimized after assessing the effect of various k-mer lengths. The k-mer size of 23 emerged as the best size for assembly with N50 length of 1,226 bp, largest contig length of 6,380 bp, average contig length of 499 bp, and transcripts with 16,527 ORF (average ORF length 810.6 bp). A total of 48,630 contigs having length of at least 200 bp were generated. These contigs made the final representatives of assembled sequences for further analyses (Fig 1). The total bases covered by contigs with length greater than 200 bp and above came out to 21,590,549.

GC content, SSRs and TFFs
The pronounced variation in GC content of the angiosperm plays a vital role in gene regulation and in determining the physical properties of the genome, and possesses deep ecological relevance [33,34]. The average GC content of L. tridentata transcripts was 45% (S3 Fig in S1 File) which is in range with the GC levels of coding sequences of angiosperms [34].
By sequence comparison with known transcription factor gene families, 4034 putative L. tridentata transcription factor genes, distributed in at least 79 families (S3 Table in S2 File) were identified (Fig 3). These genes covered transcription factor gene families (TFFs), such as C3H, FAR1, MADS, MYB-related, PHD, bHLH, NAC, C2H2, SET, SNF2, HB, WRKY, Orphans, FHA, AUX/IAA, AP2-EREBP, bZIP and many more. These TFFs have been associated with varied processes. Among all these TF gene families, C3H, FAR1, and MADS were the most abundant families (S3 Table in S2 File). Members of the C3H family are involved in embryogenesis [37]. FAR1 is the positive regulator of chlorophyll biosynthesis via activation of HEMB1 gene expression [38]. MADS contributes to the development of petals, stamens, and carpels [39]. MYB and bZIP TFFs insinuate the regulation of stress responses [40]. The members of PHD TFF are involved in vernalization processes [41]. The bHLH members are involved in controlling cell proliferation and the development of specific cell lineages [42].

Functional annotation and classification of transcriptome
The transcripts were assigned to the GO terms in order to describe the function of genes and associated gene products into three major categories namely, biological process, molecular function, and cellular component, including their sub-categories [43]. These genes were further classified into three major categories namely, biological process, molecular function, and cellular component using plant specific GO slims that broadly provide an overview of the ontology content. The functional classification of L. tridentata transcripts in biological process category (Fig 4) showed that metabolic process of nitrogen compounds (GO: 0006807) and response to stimulus (GO: 0050896) were among the highly represented groups. In the cellular component group, sequences related to the organelle part (GO:0044422) and intracellular organelle part (GO: 0044446) were well represented categories (Fig 4). Transcripts belonging to major subgroups of the molecular function categories included protein binding (GO: 0005515), organic cyclic compound binding (GO: 0097159) and heterocyclic compound binding (GO: 1901363) (Fig 4). These GO annotations provided comprehensive information on L. tridentata expressed genes that are encoding proteins (S4 Table in  The best EC classification obtained from assembled sequences annotated 1,069 enzyme codes (S6 Table in
The WRKY gene family plays a vital role in plant development and environment response. WRKY transcription factors have diverse biological functions in plants, but most notably are key players in plant responses to biotic and abiotic stresses [56]. L. tridentata encodes for a WRKY gene (K18834, 1 unigene) that further encodes for the abscisic acid signalling pathway [56] as found in the present transcriptome study (Table 1).

Distribution of L. tridentata genes involved in the biosynthesis of phenylpropanoid and flavone and comparison with T. eichlerianus and K. lanceolata
The distribution of KEGG ortholog genes involved in the three main pathways of interest e.g., (phenylpropanoid biosynthesis, flavone biosynthesis, and abscisic acid production pathway) in L. tridentata was checked in other available transcriptomes of the plants belonging to the Zygophyllaceae family, namely, K. lanceolata and T. eichlerianus. Based on the count of genes, the Z-score was calculated, and the heatmap was generated to visualize the Z-score distribution amongst the plants. It is evident from the heatmap (Fig 7) that a few genes such as K00430 were involved in the phenylpropanoid biosynthesis pathway, and were present at high

PLOS ONE
amounts in L. tridentata. Further, it is also interesting to note that the gene K00487 was present only in T. eichlerianus but was absent in L. tridentata and K. lanceolata.

Conclusions
To sum up, in the present in silico investigation, an attempt was made to characterize the transcriptome of L. tridentata. The functional enrichment analysis showed that at least 6,208 genes might participate in many important biological and metabolic pathways, including phenylpropanoid biosynthesis. The transcriptome characterization in general, and the identification of various transcripts involved in the synthesis of phenylpropanoid biosynthesis pathways in particular could be extended to comparative omics and in harnessing the medicinal properties of L. tridentata through genetic engineering.